Creating General-Purpose Corpora Using Automated Search Engine Queries
نویسنده
چکیده
The Internet is a natural source of linguistic data, providing an abundance of texts of various types in a large number of languages. These texts are already in electronic form suitable for corpus studies, either as downloadable pages, or as a resource to be searched using search engines. On the other hand, large representative corpora of the size of the British National Corpus (BNC, Aston and Burnard 1998) exist for very few languages, because they are expensive to build. They are absent even for major world languages, such as Chinese or French. Many ad-hoc text collections are available, but they are restricted in either their size or the variety of text types. Typically they are produced on the basis of out of copyright fiction (such as Project Gutenberg)1 or newswire/newspaper texts that are available in large quantities and relatively easy to acquire from their publishers (e.g., the Reuters corpus for English (Rose et al. 2002), or the Gigaword corpora for Arabic, Chinese and English (Cieri and Liberman 2002). News corpora are useful for many applications, such as development of gazeteers, parsing and word sense disambiguation algorithms, yet they cannot replace corpora representative of general language, such as the BNC, as
منابع مشابه
Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval
Recently, there have been efforts to construct written corpora by using the WWW. A promising approach to build Web corpora is to run automated queries to search engines and download pages found in this way. This makes it possible to build corpora rapidly and economically, but we cannot control what are contained in resulting corpora. Under these circumstances, it is important to verify the gene...
متن کاملAutomated Construction and Evaluation of Japanese Web-based Reference Corpora
A particularly promising approach to the use of the Web for linguistic research is to build corpora via automated queries to search engines, retrieving and post-processing the pages found in this way (Ghani et al. 2003, Baroni and Bernardini 2004, Sharoff to appear). This approach differs from the traditional method of corpus construction, where one needs to spend considerable time finding and ...
متن کاملMeasuring Web-Corpus Randomness: A Progress Report
The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness (with respect to a number of non-random partitions) of a...
متن کاملCreating Multilingual Translation Lexicons with Regional Variations Using Web Corpora
The purpose of this paper is to automatically create multilingual translation lexicons with regional variations. We propose a transitive translation approach to determine translation variations across languages that have insufficient corpora for translation via the mining of bilingual search-result pages and clues of geographic information obtained from Web search engines. The experimental resu...
متن کاملTowards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore
Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...
متن کامل